Summary

This presents the analysis of Schizosaccharomyces start codon usage and context. This uses the S. pombe 972h- translation estimates from Duncan and Mata 2017.

Load Packages

Here we go

Expression: RNA abundance and ribosome-protected-fragments

Load expression data

We remove the Sp mitochondrial genes here because their translation is not detected by this ribosome profiling protocol.

## # A tibble: 5,164 x 7
##    Gene           RNA    RPF RNA_noN RPF_noN    TE TE_noN
##    <chr>        <dbl>  <dbl>   <dbl>   <dbl> <dbl>  <dbl>
##  1 SPCC1223.02  5846. 20301.   3169.   9560.  3.47   3.02
##  2 SPBC32F12.11 4811. 19320.   2608.   6422.  4.02   2.46
##  3 SPAC26F1.06  4038. 15032.   1452.   3782.  3.72   2.60
##  4 SPBC26H8.01  2817. 13956.   1570.   7698.  4.95   4.90
##  5 SPBC1815.01  3548. 12258.   1109.   1691.  3.45   1.53
##  6 SPAC27E2.11c 8220. 11694.   9323.  20023.  1.42   2.15
##  7 SPCC13B11.01 2772. 10487.   1367.   3940.  3.78   2.88
##  8 SPBC14F5.04c 2795.  9607.    889.   1815.  3.44   2.04
##  9 SPBC19C2.07  4923.  9148.   3135.   4251.  1.86   1.36
## 10 SPAC1F8.07c  3399.  8582.   1333.   2159.  2.53   1.62
## # ... with 5,154 more rows

Check replicability

Ribosome occupancy mostly tracks RNA abundance

ATG Context

Load context data

## # A tibble: 5,118 x 19
##    Gene  aATG.context aATG.pos d1.context d1.posTSS d1.posATG d1.frame
##    <chr> <chr>           <dbl> <chr>          <dbl>     <dbl>    <dbl>
##  1 SPAC… ATTTCTACTGC…        1 AGATATCGC…        60        59        2
##  2 SPAC… ACGATTATAAG…      159 TTTAGTCCG…       266       107        2
##  3 SPAC… CAGTTTTTAGA…       61 CAGCAACTG…        85        24        0
##  4 SPAC… AAAAAAAAAAA…      339 GTAATAAGA…       376        37        1
##  5 SPAC… CTTAGCTATAA…      326 ACCGTCGAA…       371        45        0
##  6 SPAC… TTTCAATCCAA…      140 CCATTCCTC…       336       196        1
##  7 SPAC… AACTAATTCAA…      199 CTGCAGAGT…       236        37        1
##  8 SPAC… AAGTAGGAAAG…       72 GCAAAAAAC…       176       104        2
##  9 SPAC… TTTCCATCCAA…        1 AACAGCCCG…        29        28        1
## 10 SPAC… TCTTGTTAAAT…      315 TTAGAAATA…       364        49        1
## # ... with 5,108 more rows, and 12 more variables: d2.context <chr>,
## #   d2.posTSS <dbl>, d2.posATG <dbl>, d2.frame <dbl>, u1.context <chr>,
## #   u1.posTSS <dbl>, u1.posATG <dbl>, u1.frame <dbl>, u2.context <chr>,
## #   u2.posTSS <dbl>, u2.posATG <dbl>, u2.frame <dbl>

Some genes are not in both context and ribosum data

## [1] "old_SPAC1556.06.2%2CSPAC1556.06" "old_SPAC2F3.13c%2CSPAC2F3"
##  [1] "SPAC14C4.09"   "SPAC1556.06.1" "SPAC1556.06.2" "SPAC212.05c"  
##  [5] "SPAC212.07c"   "SPAC212.09c"   "SPAC212.10"    "SPAC23A1.20"  
##  [9] "SPAC23D3.05c"  "SPAC2E12.05"   "SPAC2F3.12c"   "SPAC2F3.13c"  
## [13] "SPAC750.08c"   "SPAC977.01"    "SPAC977.13c"   "SPAPB24D3.05c"
## [17] "SPBC1348.11"   "SPBC1706.02c"  "SPBC18E5.15"   "SPBC1E8.04"   
## [21] "SPBC31A8.02"   "SPBC460.01c"   "SPBC460.02c"   "SPBC460.03"   
## [25] "SPBC460.04c"   "SPBC460.05"    "SPBCPT2R1.05c" "SPBCPT2R1.06c"
## [29] "SPBCPT2R1.07c" "SPBCPT2R1.10"  "SPBPB10D8.03"  "SPBPB21E7.06" 
## [33] "SPBPB21E7.08"  "SPCC132.05c"   "SPCC1450.01c"  "SPCC1494.11c" 
## [37] "SPCC188.10c"   "SPCC18B5.02c"  "SPCC548.02c"   "SPCC576.16c"  
## [41] "SPCC622.17"    "SPCC663.07c"   "SPCC830.02"    "SPCP20C8.03"  
## [45] "SPMTR.01"      "SPMTR.02"      "SPMTR.03"      "SPMTR.04"

Annotated ATGs have a Kozak consensus sequence

Highly translated Annotated ATGs have a Kozak consensus sequence

That’s for hiTrans, the top 5% (256) translated genes by RPF TPM.

Cytoplasmic Ribosome Annotated ATGs have a Kozak consensus sequence

Most cytoribo genes are highly translated (5% highest)

Venn diagram.

There are many duplicate cytoRibo genes in Sp.

Downstream ATGs don’t have a consensus

First downstream ATG

Downstream ATGs in frame and highly translated don’t have a consensus

Except for 3rd-codon-position bias.

Calculate Information content and scores of consensus motif

Calculate a wide and a narrow consensus sequence

Calculate motif score against the position weight matrix (pwm) for both narrow (-4 from ATG through to ATG) and wide (-10 from ATG) kozak consensus motif. These motifs are taken from the top 5% highly translated genes.

Estimate the information content

Using the sequence logo, details on https://en.wikipedia.org/wiki/Sequence_logo

## # A tibble: 6 x 4
##   Genes    ATG   Width   Infon
##   <chr>    <chr> <chr>   <dbl>
## 1 All      aATG  narrow 0.826 
## 2 HiTrans  aATG  narrow 2.47  
## 3 CytoRibo aATG  narrow 2.92  
## 4 All      d1ATG narrow 0.182 
## 5 HiTrans  d1ATG narrow 0.0885
## 6 CytoRibo d1ATG narrow 0.106

Information content in bits of highly-translated consensus (excluding 6 bits from ATG), narrow is 0.72, of wide is 8.21.

Estimate information content per base across start context

Calculate scores of aATG, dATG, uATG against Kozak consensus

We calculate scores using Biostrings::PWMscoreStartingAt.

The best description I could find of this method is: https://support.bioconductor.org/p/61520/

It is just the sum of the matrix product of the PWM with the sequence.

Write scores to file scores_kozak_Sp.txt.

## # A tibble: 5,118 x 11
##    Gene  aATG.scorekn d1.scorekn u1.scorekn aATG.scorekw d1.scorekw
##    <chr>        <dbl>      <dbl>      <dbl>        <dbl>      <dbl>
##  1 SPAC…        0.816      0.756     NA            0.810      0.742
##  2 SPAC…        0.867      0.799      0.957        0.849      0.767
##  3 SPAC…        0.920      0.748     NA            0.891      0.722
##  4 SPAC…        0.938      0.887     NA            0.940      0.877
##  5 SPAC…        0.952      0.920      0.748        0.907      0.855
##  6 SPAC…        0.968      0.787     NA            0.944      0.792
##  7 SPAC…        0.948      0.848     NA            0.939      0.767
##  8 SPAC…        0.907      0.890      0.840        0.832      0.900
##  9 SPAC…        0.948      0.851     NA            0.911      0.798
## 10 SPAC…        0.898      0.887     NA            0.868      0.876
## # ... with 5,108 more rows, and 5 more variables: u1.scorekw <dbl>,
## #   d1vsan <dbl>, u1vsan <dbl>, d1vsaw <dbl>, u1vsaw <dbl>

Plot against narrow consensus (-4 to ATG)

Plot against wide consensus (-10 to ATG)

Mutual information of different positions around ATG

For all annotated genes

For highly translated genes

For cytoplasmic ribosomal proteins

What is the -2 correlation in cyto ribo genes?

There are definitely correlations there.

uATGs inhibit translation of the main ORF

uATGs are associated with lower absolute translation

uATGs are associated with lower translation efficiency

uATGs associated with lower translation efficiency are over 20nt from TSS

uATG score does not strongly affect TE

We suspect that uATG is associated with lower TE if the uATG has

  • position at least 20nt downstream from TSS
  • higher score

This figure shows that, for genes with only 1 uATG, this correlation is weak and opposite to expected.

Compare aATG and dATG context by gene

Most dATG scores are less than aATG scores

For highly translated genes, most dATG narrow scores are much less than aATG

Red: high dATG vs aATG Kozak score. Blue: highly translated. Purple: both.

Small negative correlation between dATG and aATG score narrow

R = -0.04

For highly translated genes, most dATG wide scores are much less than aATG

Red: high dATG vs aATG Kozak score. Blue: highly translated. Purple: both.

Small negative correlation between dATG and aATG score wide

R = -0.043

Genes with unusual dATG vs aATG narrow score

Those genes are in this list:

## # A tibble: 400 x 3
##    Gene          aATG.scorekn d1.scorekn
##    <chr>                <dbl>      <dbl>
##  1 SPCC1620.12c         0.711      1.000
##  2 SPBC9B6.11c          0.688      0.960
##  3 SPBC13G1.01c         0.722      0.991
##  4 SPAC222.04c          0.732      1.000
##  5 SPAC26F1.05          0.724      0.991
##  6 SPBC1289.10c         0.695      0.959
##  7 SPAC1F5.07c          0.738      1.000
##  8 SPAC1B2.06           0.740      1.000
##  9 SPAC12B10.01c        0.732      0.991
## 10 SPAC7D4.13c          0.734      0.991
## # ... with 390 more rows
  • SPAC222.04c/Ies6, Ino80 chromatin remodeling complex subunit
  • SPAC26F1.05/Mug106, meiotically upregulate, short, no known orthologs.
  • SPCC1620.12c GTPase activating protein, distand MDR1/GYP2 homolog
  • SPBC9B6.11c CCR4/nocturin family endoribonuclease, NGL2/3 homolog
  • SPAC12B10.01c HECT-type ubiquitin-protein ligase E3, UFD4 homolog
  • SPAC1B2.06, short, no known orthologs

Some of these are decently translated (top 25%)

## # A tibble: 32 x 3
##    Gene         aATG.scorekn d1.scorekn
##    <chr>               <dbl>      <dbl>
##  1 SPCC1672.02c        0.738      0.960
##  2 SPBC2G2.12          0.692      0.905
##  3 SPBC1198.08         0.724      0.935
##  4 SPAC3H1.05          0.750      0.952
##  5 SPBC25H2.16c        0.812      1.000
##  6 SPAC1565.08         0.747      0.934
##  7 SPBC1677.03c        0.724      0.908
##  8 SPBC146.13c         0.711      0.887
##  9 SPBC11C11.05        0.764      0.934
## 10 SPBC582.03          0.787      0.957
## # ... with 22 more rows
  • SPCC31H12.08c CCR4-Not complex subunit Ccr4
  • SPBC13G1.01c mitochondrial ribosomal protein subunit S4
  • SPCC188.04c NMS complex subunit Spc25
  • SPBC1198.08 dipeptidase Dug1
  • SPAC1071.01c mRNA cleavage and polyadenylation specificity factor complex subunit Pta1
  • SPAC22F3.06c Lon protease homolog Lon1 (mito localized)
  • SPBC428.01c nucleoporin Nup107
  • SPBC3F6.04c U3 snoRNP protein Nop14

Several things involved in mRNA 3’ end regulation. Otheriwse unclear. We should check if those ATGs are actually used.

dATG in frame with ATG, narrow

Files with high difference in narrow score, filtered for top 50% of RNA, in frame. Saved to dvsaATG_highdiffn_inframe_Sp.txt.

## # A tibble: 30 x 3
##    Gene          aATG.scorekn d1.scorekn
##    <chr>                <dbl>      <dbl>
##  1 SPBC9B6.11c          0.688      0.960
##  2 SPAC12B10.01c        0.732      0.991
##  3 SPAC3G6.04           0.754      0.991
##  4 SPBC428.01c          0.701      0.925
##  5 SPCC1672.02c         0.738      0.960
##  6 SPBC2G2.12           0.692      0.905
##  7 SPBC1198.08          0.724      0.935
##  8 SPBP35G2.14          0.692      0.896
##  9 SPAC3F10.05c         0.732      0.935
## 10 SPAC3H1.05           0.750      0.952
## # ... with 20 more rows

dATG out of frame with ATG, narrow

Files with high difference in narrow score, filtered for top 50% of RNA, out of frame. Saved to dvsaATG_highdiffn_outframe_Sp.txt.

## # A tibble: 127 x 3
##    Gene          aATG.scorekn d1.scorekn
##    <chr>                <dbl>      <dbl>
##  1 SPBC13G1.01c         0.722      0.991
##  2 SPAC222.04c          0.732      1.000
##  3 SPBC1289.10c         0.695      0.959
##  4 SPAC11E3.01c         0.738      0.991
##  5 SPCC31H12.08c        0.721      0.968
##  6 SPAC1071.01c         0.746      0.991
##  7 SPBC28F2.10c         0.758      1.000
##  8 SPCC663.10           0.750      0.991
##  9 SPCC1739.14          0.714      0.948
## 10 SPBC543.09           0.767      1.000
## # ... with 117 more rows

dATG in frame with ATG, wide

Files with high difference in wide score, filtered for top 50% of RNA, in frame. Saved to dvsaATG_highdiffw_inframe_Sp.txt.

## # A tibble: 30 x 3
##    Gene          aATG.scorekw d1.scorekw
##    <chr>                <dbl>      <dbl>
##  1 SPAC12B10.01c        0.729      0.978
##  2 SPBC9B6.11c          0.681      0.928
##  3 SPAC1565.08          0.690      0.907
##  4 SPAC3F10.05c         0.715      0.929
##  5 SPBC1198.08          0.699      0.910
##  6 SPAC3G6.04           0.750      0.958
##  7 SPBC428.01c          0.706      0.912
##  8 SPBC11C11.05         0.714      0.902
##  9 SPBC146.12           0.654      0.841
## 10 SPAC57A7.12          0.740      0.901
## # ... with 20 more rows

dATG out of frame with ATG, wide

Files with high difference in wide score, filtered for top 50% of RNA, out of frame. Saved to dvsaATG_highdiffw_outframe_Sp.txt.

## # A tibble: 128 x 3
##    Gene          aATG.scorekw d1.scorekw
##    <chr>                <dbl>      <dbl>
##  1 SPAC222.04c          0.731      0.998
##  2 SPAC11E3.01c         0.708      0.959
##  3 SPBC1703.02          0.678      0.925
##  4 SPBC13G1.01c         0.759      0.967
##  5 SPCC31H12.08c        0.705      0.910
##  6 SPAC29E6.03c         0.717      0.913
##  7 SPAC1071.01c         0.723      0.918
##  8 SPAC10F6.17c         0.733      0.924
##  9 SPCC1739.14          0.728      0.919
## 10 SPBC25H2.16c         0.807      0.990
## # ... with 118 more rows

Compare score difference to localization predictions

Load predictions from mitofates

In input file SPombe_mitofates.txt.

Sp Genes with high dATG vs aATG score are enriched in mitochondrial presequences

Correlations between d1 score, d1 frame, and localization

Count dATG vs ATG score, d1 frame, mito pre, enough RNA

## # A tibble: 16 x 5
## # Groups:   enoughR, d1vsaw0p1, d1.framefac [?]
##    enoughR d1vsaw0p1 d1.framefac Pred_preseq     n
##    <fct>   <fct>     <fct>       <fct>       <int>
##  1 Yes     d1lo      In          No            451
##  2 Yes     d1lo      In          Yes            51
##  3 Yes     d1lo      Out         No           1765
##  4 Yes     d1lo      Out         Yes           132
##  5 Yes     d1hi      In          No             26
##  6 Yes     d1hi      In          Yes             4
##  7 Yes     d1hi      Out         No            103
##  8 Yes     d1hi      Out         Yes            12
##  9 No      d1lo      In          No            403
## 10 No      d1lo      In          Yes            34
## 11 No      d1lo      Out         No           1770
## 12 No      d1lo      Out         Yes           108
## 13 No      d1hi      In          No             31
## 14 No      d1hi      In          Yes             4
## 15 No      d1hi      Out         No            171
## 16 No      d1hi      Out         Yes            11

Sp mito-localized genes also do not have a distinctive aATG context

Although the +5 T is striking.

Genes with unusual dATG vs aATG narrow score AND predicted mito loc

Those genes are in this list:

## # A tibble: 31 x 9
##    Gene  aATG.scorekn d1.scorekn d1.frame d1.posATG   RNA    RPF RNA_noN
##    <chr>        <dbl>      <dbl>    <dbl>     <dbl> <dbl>  <dbl>   <dbl>
##  1 SPBC…        0.722      0.991        2       128 144.   48.5     92.0
##  2 SPCC…        0.748      1.000        1        31  69.7   2.57    53.9
##  3 SPAP…        0.701      0.951        2        80  85.9  12.3     37.4
##  4 SPBC…        0.767      1.000        1       274 196.   51.7    215. 
##  5 SPBC…        0.778      1.000        0       216  50.3  19.1     44.5
##  6 SPBC…        0.769      0.991        1        10 232.   30.6    121. 
##  7 SPCC…        0.769      0.991        2        56  66.0   2.76    78.3
##  8 SPBC…        0.701      0.917        2        62 134.   43.6    118. 
##  9 SPBC…        0.692      0.905        0        75 264.  173.     118. 
## 10 SPBC…        0.757      0.966        0        24  92.7  26.9     90.6
## 11 SPAC…        0.748      0.952        2       122 311.   44.3    355. 
## 12 SPAC…        0.715      0.912        2        71  53.9   3.03    62.8
## 13 SPBC…        0.758      0.951        2        86  64.7   3.47    21.6
## 14 SPBC…        0.724      0.908        2        95 394.  149.     275. 
## 15 SPBC…        0.711      0.887        1        73 280.   88.4    466. 
## 16 SPAC…        0.831      1.000        1        49  90.6  23.1    112. 
## 17 SPAC…        0.732      0.900        0         3 331.  272.     694. 
## 18 SPBC…        0.722      0.890        0        39 117.   41.2    111. 
## 19 SPBC…        0.743      0.908        2        68 184.   73.5    225. 
## 20 SPBC…        0.722      0.881        2         5  80.4   4.74    57.3
## 21 SPAC…        0.790      0.948        2       104 112.   33.8     98.5
## 22 SPAP…        0.764      0.920        2        77  38.2  12.3     93.8
## 23 SPBC…        0.755      0.908        2        68  84.9  18.4     88.3
## 24 SPAC…        0.707      0.859        1        73  90.1   9.58    92.5
## 25 SPCC…        0.780      0.930        2        41  69.8   2.94    24.9
## 26 SPAC…        0.688      0.837        0        42  58.5  11.4     43.0
## # ... with 5 more rows, and 1 more variable: RPF_noN <dbl>
  • SPBC13G1.01c mitochondrial ribosomal protein subunit S4
  • SPBC9B6.11c CCR4/nocturin family endoribonuclease, NGL2/3 homolog
  • SPAPB1A10.11c mitochondrial glutamyl-tRNA ligase Mse1
  • SPCC1183.04c mitochondrial RNA metabolism pathway protein Pet127
  • SPAC22F3.06c Lon1
  • SPBC1347.07 RNA exonuclease Rex2, processes RNA in nuc and mito, multiple differentially localized paralogs in Scer.
  • SPBC3F6.04c U3 snoRNP protein Nop14
  • SPBC16E9.06c BolA domain UV induced protein Uvi31
  • SPBC146.12 monooxygenase Coq6
  • SPBC3D6.03c mitochondrial 3’-tRNA processing endonuclease tRNAse Z, Trz2, mito ortholog of Trz1, and Trz1 is dual-localized in Scer.
  • SPBC16A3.14 superoxide dismutase, mitochondrial ribosomal protein subunit
  • SPBC543.09 mitochondrial m-AAA protease Yta12
  • SPBC2G2.12 mitochondrial and cytoplasmic histidine-tRNA ligase Hrs1
  • SPBC1539.01c mitochondrial ribosomal protein subunit L15 Mrp15
  • SPAC4A8.03c protein phosphatase 2C Ptc4
  • SPBC1677.03c threonine ammonia-lyase Tda1. Scer paralogs ILV1 (mito) and SRY1 (loc not known)
  • SPBC146.13c myosin type I
  • SPAC1B3.04c mitochondrial elongation factor GTPase Guf1
  • SPAC6C3.04 citrate synthase Cit1. Scer has two paralogs, Cit1 (mito) and Cit2 (cyto/peroxisomal). But here the 2nd ATG is right next to the first ATG so that’s probably just an annotaion off by 1.
  • SPBC15C4.02 ABC1 kinase family protein, implicated in mito activity, Mcp2 homolog

Histidine tRNA Ligase, Hurrah!

Several mitochondrial ribosomal proteins, check these.

Mostly the 2nd ATG is not in frame though. Restricting to frame gives only Rex2 and HisRS.

less strong biases, but in frame dATG

## # A tibble: 94 x 8
##    Gene    aATG.scorekn d1.scorekn d1.posATG   RNA     RPF RNA_noN RPF_noN
##    <chr>          <dbl>      <dbl>     <dbl> <dbl>   <dbl>   <dbl>   <dbl>
##  1 SPBC13…        0.778      1.000       216  50.3 1.91e+1    44.5 1.52e+1
##  2 SPBC2G…        0.692      0.905        75 264.  1.73e+2   118.  6.19e+1
##  3 SPBC16…        0.757      0.966        24  92.7 2.69e+1    90.6 2.46e+1
##  4 SPAC6C…        0.732      0.900         3 331.  2.72e+2   694.  7.98e+2
##  5 SPBC14…        0.722      0.890        39 117.  4.12e+1   111.  4.75e+1
##  6 SPAC3H…        0.688      0.837        42  58.5 1.14e+1    43.0 1.36e+1
##  7 SPBC2G…        0.738      0.870         9 172.  3.27e+1    69.1 7.11e+0
##  8 SPBC60…        0.758      0.868        84  23.6 1.69e+0    22.4 3.39e+0
##  9 SPAC24…        0.786      0.891         3  73.5 9.73e+0    90.9 1.57e+1
## 10 SPAC23…        0.763      0.864        36 176.  1.01e+2   231.  1.76e+2
## 11 SPAC57…        0.819      0.917        18  14.2 3.19e-1    13.7 4.68e-1
## 12 SPBC21…        0.760      0.852         3 208.  1.07e+2   221.  1.95e+2
## 13 SPBC2G…        0.864      0.952       105 794.  1.62e+3  1170.  2.54e+3
## 14 SPBC12…        0.795      0.878        60  97.4 9.91e+1   237.  3.49e+2
## 15 SPBC21…        0.925      1.000         9 259.  8.78e+1   335.  1.90e+2
## 16 SPAC4F…        0.917      0.991        60 169.  3.94e+1   162.  3.60e+1
## 17 SPCC16…        0.731      0.790        54  66.4 3.86e+2    72.9 7.99e+2
## 18 SPBC88…        0.779      0.838        60 136.  1.44e+1   148.  1.33e+1
## 19 SPBC16…        0.935      0.991        42 375.  7.17e+1   322.  4.96e+1
## 20 SPBP8B…        0.824      0.878        21  90.8 2.63e+1    51.1 1.78e+1
## # ... with 74 more rows
  • SPBC2G2.08 ade9 C-1-tetrahydrofolatesynthase &c, Scer has MIS1 (mito) & ADE3 (cyto) homologs. Predicted cleavage site is 12aas in.
  • SPBC21C3.04c mitochondrial ribosomal protein subunit L34, but ATGATG
  • SPAC3H1.10 phytochelatin synthetase
  • SPAC4F8.02c mitochondrial ribosomal protein subunit L40
  • SPAC23C11.17 mdm28 mitochondrial inner membrane protein potassium transport
  • SPAC26F1.14c apoptosis-inducing factor homolog Aif1, orthologs dual-localized? No Scer homolog.
  • SPBC1861.05 pseudouridine-metabolizing bifunctional protein.
  • SPBC2G2.04c mitochondrial matrix protein, YjgF family protein Mmf1, Scer has mito (MMF1) and cyto (HMF1) paralogs.
  • SPCC1682.01 qcr9 ubiquinol-cytochrome-c reductase complex subunit 9

Some are dual-localized.

Predict that many more of the dual-localized things, such as aa-tRNA-synthetases, have non-ATG starts.

Back to table of contents